Summary of MLMT


Table of Contents


The Learning Theory

Essence of Machine Learning

Learning model = Hypothesis Set + Learning Algorithm

A Simple Hypothesis Set (Perceptron)

A Simple Learning Algorithm (PLA)

Hoeffding's Inequality

The probability of the difference between sample and fact larger than the tolerance is bounded by a confidence upper bound, which dependent of the number of samples.

If learning is feasible

Union Bound

Hoeffding's Inequality doesn't apply to hypothesis set.

Vapnik-Chervonenkis Inequality

Generalization:

Tradeoff


Machine Learning Algorithm

KNN

Linear Classifier

SVM

Hinge loss -> Multiclass SVM loss

Softmax

Multinomial Logistic Regression
interpret raw classifier scores as probabilities

Logistic Regression

Regularization

Decision Tree

Random Forest


Neural Network

Backpropagation

Why do we need Backpropagation?

Definition: use chain rule along a computational graph recursively to obtain the gradient

Downside: not practical to formulate gradient formula by hand for all parameters of large neural nets

Implementations


Price of nonlinear transform?

Looking at the data before choosing the model and be hazardous to your E_out (data snooping)


Optimization

In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

Gradient Descent

Gradient descent is an optimization algorithm, that starts from a random point on the cost function and iteratively travels down its slope in steps until it reaches the lowest cost (local minimum)

step size = gradient * learning rate.

downside: gradient descent is slow on huge data.

Stochastic Gradient Descent (SGD)

SGD randomly picks one data point from the whole data set at each Gradient Descent iteration to reduce the computations enormously.

Benefits

It is also common to sample a small number of data points instead of just one point at each step and that is called “mini-batch” gradient descent. Mini-batch tries to strike a balance between the goodness of gradient descent and speed of SGD.

Problems with SGD

SGD + Momentum

The optimization process can then be seen as equivalent to the process of simulating a particle as rolling on the landscape, where the gradient only directly influences the "velocity" and in turn has an effect on the weight (x).

Nesterov Momentum

Different version of Momentum. Not evaluate the current gradient but the gradient at the "looked ahead" position.

AdaGrad

Scale the gradient based on the accumulated sum of squares of the previous gradient.

but

RMSProp

So we introduce decay rate in AdaGrad, so that the accumulated gradient square focus on recent gradient square (the influence of large slope would disappear after some steps)

Adam

Combine Momentum and AdaGrad/RMSProp

Bias correction for the fact that first and second moment estimates start at zero


Learning Rate Selection

more critical with SGD+Momentum, less common with Adam


Second-Order Optimization

in practice: Adam -> L-BFGS (If you can afford to do full batch updates)


Regularization for Deep Learning

pattern:

Training: add some kind of randomness
Test: average out randomness


Transfer Learning


Activation function

Sigmoid function

Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron

Downsides:

Tanh(x)

ReLU

upsides:

downsides:

Leaky ReLU

Exponential Linear Units (ELU)

but

Maxout “Neuron”

but

In practice


Data Preprocessing

In practice:


Weight Initialization

Zero Initialization

As its name suggests, implementing this method sets all the weights to zeros. This method serves almost no purpose as it causes neurons to perform the same calculation in each iterations and produces same outputs.

If all the weights are initialized to zeros, the derivatives will remain same for every w in the layer. As a result, neurons will learn same features in each iterations(fail to break symmetry). And not only zero, any constant initialization will produce a poor result.

Random Initialization

It prevents neuron from learning the same features of its inputs. But it has gradient vanishing and gradient exploding problem when the neural network has many layers. (deep)

Xavier Initialization (for tanh activation)

The goal of Xavier Initialization is to initialize the weights such that the variance of the activations are the same across every layer. This constant variance helps prevent the gradient from exploding or vanishing.

Xavier initialization sets a layer’s weights to values chosen from a random uniform distribution that’s bounded

W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)

work for tanh and sigmoid, not work for ReLU

He initialization (for ReLU)

W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in/2)


Batch Normalization

Deep neural network assumes that the output of the prior layer come from the same distribution. But all layers are actually changing during the update (learning) process, so the update procedure is forever chasing a moving target.

Batch Normalization standardizes the activations of the prior layer in every single batch. (Unit gaussian activation) (Standardization/Normalizaiton: rescaling data to have a mean of zero and a standard deviation of one)

assume the activation function is s-shaped like tanh and sigmoid, usually inserted after fully connected or convolutional layers, before nonlineality. But in practice, after activation works even better for activation function that result non-gaussian distribution output like ReLU.

We need to recover the identity mapping

with

This step is different at training and test time. Training: calculate mean/std are computed based on batch. Test: fixed empirical mean of activations during training is used (estimate during training with running averages)

but


Babysitting the Learning Process

Learning Rate adjusting

Check Initialization

bad initialization makes loss not decreasing at first than after a while, it would be normal again.

Check accuracy

Check update ratio

better around 0.001


Hyperparameter Searching

https://www.sicara.ai/blog/2019-14-07-determine-network-hyper-parameters-with-bayesian-optimization

Grid Search

The easiest way to search for a good combination of hyperparameters is to exhaustively try them all. Basically, it tries out every possible combination of hyperparameters users defined, and select the best combination.

but

Random Search

Random search consists in sampling random values in the hyperparameter space.

but

Bayesian Optimization

View hyperparameters tuning as the optimization of a black-box function, objective function is the validation error of a machine learning model using a set of hyperparameters. It constructs a probabilistic representation of the machine learning algorithm’s performance.

In Bayesian optimization, the performance function is modeled as a sample from a Gaussian process (GP) over the hyperparameter value. The posterior distribution for this function allows insight in the confidence of the function’s value in the hyperparameter space. Once the probabilistic model is obtained: either sample near the highest observed value, where performance is likely to increase; or explore another subset of the hyperparameter space, where information is lacking and another maximum could be hiding. (exploitation and exploration trade-off)

we should choose the next point x where the mean is high (exploitation) and the variance is high (exploration)


Convolutional Neural Network

Layers and Structures

original image -> low-level features -> mid-level features -> high-level features -> linearly separable classifier

Convolutional Layer

Pooling Layer

Fully Connect Layer


LeNet-5

AlexNet

ZFNet

VGGNet

Why use smaller filters? (3x3 conv)

GoogleLeNet

“Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other

ResNet

Deeper model performs worse on both training and test error (not caused by overfitting)

Full ResNet architecture:

In practice:

Summary

More


Recurrent Neural Network

The same function and the same set of parameters are used at every time step!

Type and application

Truncated Backpropagation

Dealing with Gradient vanishing and exploding

Long Short Term Memory

Summary


Reinforcement Learning

Problems involving an agent interacting with an environment, which provides numeric reward signals

Goal: Learn how to take actions in order to maximize reward

Mathematical Formulation of RL

How to handle the randomness

Maximize the expected sum of rewards!

Policy Evaluation

Optimal Policy

Value iteration algorithm: Use Bellman equation as an iterative update

Bellman equation:

If the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of r + γQ *(s',a')

The optimal policy u* corresponds to taking the best action in any state as specified by Q *

Cons:

Solution: use a function approximator to estimate Q(s,a) (Neural Network)

Q-Learning

Use a function approximator to estimate the action-value function (Deep Q-Learning (If the function approximator is a neural )network) -> Q(s,a;θ) approximate Q*(s,a)

Policy Gradient

Actor-Critic Algorithm

Actor (Policy Gradient -> policy) + Critic (Q-Learning -> Q-function)


Dimensionality Reduction

Goal of dimensionality reduction is to discover the axis of data!

SVD

A = UΣVT (unique)

Relation to Eigen-decomposition

Computation of SVD (1)

Computation of SVD (2)

How to reduce dimensionality?

Set smallest singular values to zero

How to decide the remaining dimensionality?

Pick r so the retained singular values have at least 90% of the total energy. (remember to square)

Pros & Cons:

t-SNE (t distributed Stochastic Neighborhood Embedding)

t-SNE tends to preserve local structure at the same time preserving the global structure as much as possible. What t-SNE does is find a way to project data into a low dimensional space so that the clustering in the high dimensional space is preserved.

Drawbacks of SNE

Novel features in t-SNE


Unsupervised Learning

Autoencoder

It is then trained by using as target data the same images as the input images, meaning that the autoencoder learns to reconstruct the original inputs.

Why dimensionality reduction?

Want features to capture meaningful factors of variation in data

How to learn this feature representation?

Train such that features can be used to reconstruct original data

Different type

Additive Term to the Reconstruction Error

Instead of fixed code, VAE learn mean and variance from images in the latent space (encoder network), assume images were generated by statistical process. use mean and variance to randomly sample element from distribution and decode that into image (decoder network).

First, an encoder module turns the input samples input_img into two parameters in a latent space of representations, which we will note z_mean and z_log_variance.

Then, we randomly sample a point z from the latent normal distribution that is assumed to generate the input image, via z = z_mean + exp(z_log_variance) * epsilon, where epsilon is a random tensor of small values.

Finally, a decoder module will map this point in the latent space back to the original input image. Because epsilon is random, the process ensures that every point that is close to the latent location where we encoded input_img (z-mean) can be decoded to something similar to input_img, thus forcing the latent space to be continuously meaningful. Any two close points in the latent space will decode to highly similar images.

Continuity, combined with the low dimensionality of the latent space, forces every direction in the latent space to encode a meaningful axis of variation of the data, making the latent space very structured and thus, highly suitable to manipulation via concept vectors.

The parameters of a VAE are trained via two loss functions:

GAN